robot policy
Learning Robot Manipulation from Audio World Models
World models have demonstrated impressive performance on robotic learning tasks. Many such tasks inherently demand multimodal reasoning; for example, filling a bottle with water will lead to visual information alone being ambiguous or incomplete, thereby requiring reasoning over the temporal evolution of audio, accounting for its underlying physical properties and pitch patterns. In this paper, we propose a generative latent flow matching model to anticipate future audio observations, enabling the system to reason about long-term consequences when integrated into a robot policy. We demonstrate the superior capabilities of our system through two manipulation tasks that require perceiving in-the-wild audio or music signals, compared to methods without future lookahead. We further emphasize that successful robot action learning for these tasks relies not merely on multi-modal input, but critically on the accurate prediction of future audio states that embody intrinsic rhythmic patterns. Research in this domain has primarily concentrated on the following directions: 1) video-based models (Liang et al., 2025; Assran et al., 2025) that predict future visual frames from present observations, encoding the causal dependencies critical for physical interaction.
Invariance Co-training for Robot Visual Generalization
Yang, Jonathan, Finn, Chelsea, Sadigh, Dorsa
Abstract-- Reasoning from diverse observations is a fundamental capability for generalist robot policies to operate in a wide range of environments. Despite recent advancements, many large-scale robotic policies still remain sensitive to key sources of observational variation--such as changes in camera perspective, lighting, and the presence of distractor objects. We posit that the limited generalizability of these models arises from the substantial diversity required to robustly cover these quasistatic axes, coupled with the current scarcity of large-scale robotic datasets that exhibit rich variation across them. In this work, we propose to systematically examine what robots need to generalize across these challenging axes by introducing two key auxiliary tasks--state similarity and invariance to observational perturbations--applied to both demonstration data and static visual data. We then show that via these auxiliary tasks, leveraging both more-expensive robotic demonstration data and less-expensive, visually rich synthetic images generated from non-physics-based simulation (e.g., Unreal Engine) can lead to substantial increases in generalization to unseen camera viewpoints, lighting configurations, and distractor conditions. Our results demonstrate that co-training on this diverse data improves performance by 18% over existing generative augmentation methods. Robotic foundation models have shown impressive progress in generalizing to everyday scenarios by leveraging large-scale datasets spanning multiple embodiments, environments, and tasks [1], [2]. However, despite their breadth, the resulting models often remain brittle in real-world settings--failing to handle unseen spatial configurations of objects or adapt to drastic visual changes such as lighting and viewpoint shifts. We hypothesize that the brittleness of current robotic policies stems from insufficient coverage of key observational factors during training. For example, many large-scale datasets provide only one or two third-person perspectives per scene, limiting robustness to viewpoint shifts.
Teaching robot policies without new demonstrations: interview with Jiahui Zhang and Jesse Zhang
In their paper ReWiND: Language-Guided Rewards Teach Robot Policies without New Demonstrations, which was presented at CoRL 2025, and introduce a framework for learning robot manipulation tasks solely from language instructions without per-task demonstrations. We asked Jiahui Zhang and Jesse Zhang to tell us more. What is the topic of the research in your paper, and what problem were you aiming to solve? Our research addresses the problem of enabling robot manipulation policies to solve novel, language-conditioned tasks without collecting new demonstrations for each task. We begin with a small set of demonstrations in the deployment environment, train a language-conditioned reward model on them, and then use that learned reward function to fine-tune the policy on unseen tasks, with no additional demonstrations required.
- North America > United States > California (0.16)
- North America > United States > Texas (0.05)
- North America > United States > Oregon (0.05)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.05)
RoboArena: Distributed Real-World Evaluation of Generalist Robot Policies
Atreya, Pranav, Pertsch, Karl, Lee, Tony, Kim, Moo Jin, Jain, Arhan, Kuramshin, Artur, Eppner, Clemens, Neary, Cyrus, Hu, Edward, Ramos, Fabio, Tremblay, Jonathan, Arora, Kanav, Ellis, Kirsty, Macesanu, Luca, Villasevil, Marcel Torne, Leonard, Matthew, Cho, Meedeum, Aslan, Ozgur, Dass, Shivin, Wang, Jie, Reger, William, Yuan, Xingfang, Yang, Xuning, Gupta, Abhishek, Jayaraman, Dinesh, Berseth, Glen, Daniilidis, Kostas, Martin-Martin, Roberto, Lee, Youngwoon, Liang, Percy, Finn, Chelsea, Levine, Sergey
Comprehensive, unbiased, and comparable evaluation of modern generalist policies is uniquely challenging: existing approaches for robot benchmarking typically rely on heavy standardization, either by specifying fixed evaluation tasks and environments, or by hosting centralized ''robot challenges'', and do not readily scale to evaluating generalist policies across a broad range of tasks and environments. In this work, we propose RoboArena, a new approach for scalable evaluation of generalist robot policies in the real world. Instead of standardizing evaluations around fixed tasks, environments, or locations, we propose to crowd-source evaluations across a distributed network of evaluators. Importantly, evaluators can freely choose the tasks and environments they evaluate on, enabling easy scaling of diversity, but they are required to perform double-blind evaluations over pairs of policies. Then, by aggregating preference feedback from pairwise comparisons across diverse tasks and environments, we can derive a ranking of policies. We instantiate our approach across a network of evaluators at seven academic institutions using the DROID robot platform. Through more than 600 pairwise real-robot evaluation episodes across seven generalist policies, we demonstrate that our crowd-sourced approach can more accurately rank the performance of existing generalist policies than conventional, centralized evaluation approaches, while being more scalable, resilient, and trustworthy. We open our evaluation network to the community and hope that it can enable more accessible comparisons of generalist robot policies.
- North America > United States > Pennsylvania (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- (4 more...)
- Research Report (0.50)
- Overview (0.46)
RobotArena $\infty$: Scalable Robot Benchmarking via Real-to-Sim Translation
Jangir, Yash, Zhang, Yidi, Yamazaki, Kashu, Zhang, Chenyu, Tu, Kuan-Hsun, Ke, Tsung-Wei, Ke, Lei, Bisk, Yonatan, Fragkiadaki, Katerina
The pursuit of robot generalists - instructable agents capable of performing diverse tasks across diverse environments - demands rigorous and scalable evaluation. Yet real-world testing of robot policies remains fundamentally constrained: it is labor-intensive, slow, unsafe at scale, and difficult to reproduce. Existing simulation benchmarks are similarly limited, as they train and test policies within the same synthetic domains and cannot assess models trained from real-world demonstrations or alternative simulation environments. As policies expand in scope and complexity, these barriers only intensify, since defining "success" in robotics often hinges on nuanced human judgments of execution quality. In this paper, we introduce a new benchmarking framework that overcomes these challenges by shifting VLA evaluation into large-scale simulated environments augmented with online human feedback. Leveraging advances in vision-language models, 2D-to-3D generative modeling, and differentiable rendering, our approach automatically converts video demonstrations from widely used robot datasets into simulated counterparts. Within these digital twins, we assess VLA policies using both automated VLM-guided scoring and scalable human preference judgments collected from crowdworkers, transforming human involvement from tedious scene setup, resetting, and safety supervision into lightweight preference comparisons. To measure robustness, we systematically perturb simulated environments along multiple axes, such as textures and object placements, stress-testing policy generalization under controlled variation. The result is a continuously evolving, reproducible, and scalable benchmark for real-world trained robot manipulation policies, addressing a critical missing capability in today's robotics landscape.
MemER: Scaling Up Memory for Robot Control via Experience Retrieval
Sridhar, Ajay, Pan, Jennifer, Sharma, Satvik, Finn, Chelsea
Humans routinely rely on memory to perform tasks, yet most robot policies lack this capability; our goal is to endow robot policies with the same ability. Naively conditioning on long observation histories is computationally expensive and brittle under covariate shift, while indiscriminate subsampling of history leads to irrelevant or redundant information. We propose a hierarchical policy framework, where the high-level policy is trained to select and track previous relevant keyframes from its experience. The high-level policy uses selected keyframes and the most recent frames when generating text instructions for a low-level policy to execute. This design is compatible with existing vision-language-action (VLA) models and enables the system to efficiently reason over long-horizon dependencies. Our approach, MemER, outperforms prior methods on three real-world long-horizon robotic manipulation tasks that require minutes of memory. Videos and code can be found at https://jen-pan.github.io/memer/. In recent times, we have seen significant strides in the language-following and generalization capabilities of robotic manipulation policies (Brohan et al., 2023; Intelligence et al., 2025; Kim et al., 2024; NVIDIA et al., 2025; Team et al., 2025). While these policies are improving for real-world deployment, a critical limitation remains: the absence of long-term memory. Memory allows humans to handle the inherent partial observability found in their environment. For instance, if a person wanted to make a sandwich, they would have to recall where they saw the jar of peanut butter or the knife, especially if these items had not been recently viewed. The ability to form and retrieve memories is a crucial step towards robots solving complex, multi-step tasks.
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > Montserrat (0.04)
WorldGym: World Model as An Environment for Policy Evaluation
Quevedo, Julian, Sharma, Ansh Kumar, Sun, Yixiang, Suryavanshi, Varad, Liang, Percy, Yang, Sherry
Robots can help humans in ways that range from home robots performing chores (Shafiullah et al., With the development of generative models trained on large-scale video data (Ho et al., 2022; Villegas et al., 2022; Singer et al., 2022), recent work has shown that video world models can visually emulate See videos and code at https://world-model-eval.github.io Inspired by this observation, we propose a world-model-based policy evaluation environment (WorldGym), as shown in Figure 1. To enable efficient rollouts of policies which predict different-length action chunks, WorldGym aligns its diffusion horizon length with policies' chunk sizes at inference time. With video rollouts from the world model, WorldGym then uses a vision-language model (VLM) to determine tasks' success We then use the world model to evaluate VLA-based robot policies by rolling out the policies in the world model starting from real initial frames, and compare their success rates (policy values) in WorldGym to those achieved in real-world experiments. We propose flexibly aligning diffusion horizon length with policies' action chunk sizes for efficient We consider a multi-task, finite-horizon, partially observable Markov Decision Process (POMDP) (Puterman, 2014; Kaelbling et al., 1995), specified by In this section, we first describe our implementation of world model training and inference.
- Asia > Middle East > Jordan (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- (3 more...)
- Information Technology (0.67)
- Leisure & Entertainment (0.46)
SOE: Sample-Efficient Robot Policy Self-Improvement via On-Manifold Exploration
Jin, Yang, Lv, Jun, Xue, Han, Chen, Wendi, Wen, Chuan, Lu, Cewu
Y et robot policies often lack sufficient exploration capability due to action mode collapse. Existing methods that encourage exploration typically rely on random perturbations, which are unsafe and induce unstable, erratic behaviors, thereby limiting their effectiveness. We propose Self-Improvement via On-Manifold Exploration (SOE), a framework that enhances policy exploration and improvement in robotic manipulation. SOE learns a compact latent representation of task-relevant factors and constrains exploration to the manifold of valid actions, ensuring safety, diversity, and effectiveness. It can be seamlessly integrated with arbitrary policy models as a plug-in module, augmenting exploration without degrading the base policy performance. Moreover, the structured latent space enables human-guided exploration, further improving efficiency and controllability. Extensive experiments in both simulation and real-world tasks demonstrate that SOE consistently outperforms prior methods, achieving higher task success rates, smoother and safer exploration, and superior sample efficiency. These results establish on-manifold exploration as a principled approach to sample-efficient policy self-improvement.
Fine-Tuning Robot Policies While Maintaining User Privacy
Christie, Benjamin A., Parekh, Sagar, Losey, Dylan P.
Recent works introduce general-purpose robot policies. These policies provide a strong prior over how robots should behave -- e.g., how a robot arm should manipulate food items. But in order for robots to match an individual person's needs, users typically fine-tune these generalized policies -- e.g., showing the robot arm how to make their own preferred dinners. Importantly, during the process of personalizing robots, end-users leak data about their preferences, habits, and styles (e.g., the foods they prefer to eat). Other agents can simply roll-out the fine-tuned policy and see these personally-trained behaviors. This leads to a fundamental challenge: how can we develop robots that personalize actions while keeping learning private from external agents? We here explore this emerging topic in human-robot interaction and develop PRoP, a model-agnostic framework for personalized and private robot policies. Our core idea is to equip each user with a unique key; this key is then used to mathematically transform the weights of the robot's network. With the correct key, the robot's policy switches to match that user's preferences -- but with incorrect keys, the robot reverts to its baseline behaviors. We show the general applicability of our method across multiple model types in imitation learning, reinforcement learning, and classification tasks. PRoP is practically advantageous because it retains the architecture and behaviors of the original policy, and experimentally outperforms existing encoder-based approaches. See videos and code here: https://prop-icra26.github.io.
MotionTrans: Human VR Data Enable Motion-Level Learning for Robotic Manipulation Policies
Yuan, Chengbo, Zhou, Rui, Liu, Mengzhen, Hu, Yingdong, Wang, Shengjie, Yi, Li, Wen, Chuan, Zhang, Shanghang, Gao, Yang
Scaling real robot data is a key bottleneck in imitation learning, leading to the use of auxiliary data for policy training. While other aspects of robotic manipulation such as image or language understanding may be learned from internet-based datasets, acquiring motion knowledge remains challenging. Human data, with its rich diversity of manipulation behaviors, offers a valuable resource for this purpose. While previous works show that using human data can bring benefits, such as improving robustness and training efficiency, it remains unclear whether it can realize its greatest advantage: enabling robot policies to directly learn new motions for task completion. In this paper, we systematically explore this potential through multi-task human-robot cotraining. We introduce MotionTrans, a framework that includes a data collection system, a human data transformation pipeline, and a weighted cotraining strategy. By cotraining 30 human-robot tasks simultaneously, we direcly transfer motions of 13 tasks from human data to deployable end-to-end robot policies. Notably, 9 tasks achieve non-trivial success rates in zero-shot manner. MotionTrans also significantly enhances pretraining-finetuning performance (+40% success rate). Through ablation study, we also identify key factors for successful motion learning: cotraining with robot data and broad task-related motion coverage. These findings unlock the potential of motion-level learning from human data, offering insights into its effective use for training robotic manipulation policies. All data, code, and model weights are open-sourced https://motiontrans.github.io/.